A unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.
Data Link: https://www.kaggle.com/datasets/deepcontractor/unicorn-companies-dataset
There are 1037 data rows.
## [1] "Company"
## [1] "Valuation...B."
## [1] "Date.Joined"
## [1] "Country"
## [1] "City"
## [1] "Industry"
## [1] "Select.Inverstors"
## [1] "Founded.Year"
## [1] "Total.Raised"
## [1] "Financial.Stage"
## [1] "Investors.Count"
## [1] "Deal.Terms"
## [1] "Portfolio.Exits"
And 13 columns.
In the data cleaning process, we made three functions for convenience.
NA Value Check
The function will return the number of NA for each columns
numberOfNa = function(df){
flag = 'None' # set 'None' as a checking flag
for(i in 1:ncol(df)){
temp = df[, i] # extract column one by one
n = length(temp[temp == flag]) # count how many 'None'
print(paste(colnames(df)[i], n)) # print column name and 'None' quantity
}
}
Drop NA Value
The function will drop all NA value of specific column
dropNone = function(df, columnName){
drop = which(df[, columnName] == 'None')
df = df[-drop, ]
return(df)
}
Data Type Check
The function will return the data type for each column
checkType = function(df){
for(i in 1:ncol(df)){
temp = df[, i]
print(paste(colnames(df)[i], '--->', typeof(temp)))
}
}
There number shows that how many NA value are there in every column
numberOfNa(df)
## [1] "Company 0"
## [1] "Valuation...B. 0"
## [1] "Date.Joined 0"
## [1] "Country 0"
## [1] "City 0"
## [1] "Industry 0"
## [1] "Select.Inverstors 17"
## [1] "Founded.Year 43"
## [1] "Total.Raised 24"
## [1] "Financial.Stage 988"
## [1] "Investors.Count 1"
## [1] "Deal.Terms 29"
## [1] "Portfolio.Exits 988"
First, Drop two columns with 988 NA values. Moreover, drop ‘Select.Inverstors’ column because of it’s redundancy.
Second, drop the data rows which have NA values. For example, ‘Founded.Year’ has 43 NA values, so we drop all of them.
df = df[, -which(colnames(df) %in% c('Financial.Stage', 'Portfolio.Exits'))] # Drop these two columns
# after check column Select.Inverstors
# it's not suitable to analysis
# and I think it's not important
df = df[, -which(colnames(df) %in% c('Select.Inverstors'))] # Drop
### Drop NA for each columns
df = dropNone(df, 'Founded.Year') # drop NA in Founded.Year
df = dropNone(df, 'Deal.Terms') # drop NA in Deal.Terms
df = dropNone(df, 'Total.Raised') # drop NA in Total.Raised
The output is the data type of each column before any processing.
## [1] "Company ---> character"
## [1] "Valuation...B. ---> character"
## [1] "Date.Joined ---> character"
## [1] "Country ---> character"
## [1] "City ---> character"
## [1] "Industry ---> character"
## [1] "Founded.Year ---> character"
## [1] "Total.Raised ---> character"
## [1] "Investors.Count ---> character"
## [1] "Deal.Terms ---> character"
Before doing any analysis, we have to clean the data value in each column.
For “Valuation…B”, change string data type into numeric data type. E.g., “$5.3” –> 5.3
For “Total.Raised”, change string data type into numeric data type and set ‘million’ as the column unit. E.g., “$7.44B” –> 7440
For “Date.Joined”, split the string data type into three new columns, dayJoin, MonthJoin, and yearJoin, in numeric data type. E.g., “4/7/2017” –> 4, 7, 2017 into three different columns.
For “Founded.Year”, “Deal.Terms”, and “Investors.Count”, turn the string data type into numeric data type. E.g., “2019” –> 2019
There is no more NA value in the data set.
## [1] "Company 0"
## [1] "Valuation...B. 0"
## [1] "Date.Joined 0"
## [1] "Country 0"
## [1] "City 0"
## [1] "Industry 0"
## [1] "Founded.Year 0"
## [1] "Total.Raised 0"
## [1] "Investors.Count 0"
## [1] "Deal.Terms 0"
## [1] "dayJoin 0"
## [1] "monthJoin 0"
## [1] "yearJoin 0"
The data type of each column is now correct and useful.
## [1] "Company ---> character"
## [1] "Valuation...B. ---> double"
## [1] "Date.Joined ---> character"
## [1] "Country ---> character"
## [1] "City ---> character"
## [1] "Industry ---> character"
## [1] "Founded.Year ---> double"
## [1] "Total.Raised ---> double"
## [1] "Investors.Count ---> double"
## [1] "Deal.Terms ---> double"
## [1] "dayJoin ---> double"
## [1] "monthJoin ---> double"
## [1] "yearJoin ---> double"
After the cleaning, the total data rows is dropped to 962 from 1037.
## [1] 962
The columns after deleting redundant columns and creating new columns.
## [1] "Company"
## [1] "Valuation...B."
## [1] "Date.Joined"
## [1] "Country"
## [1] "City"
## [1] "Industry"
## [1] "Founded.Year"
## [1] "Total.Raised"
## [1] "Investors.Count"
## [1] "Deal.Terms"
## [1] "dayJoin"
## [1] "monthJoin"
## [1] "yearJoin"
Before
After
For ease of analysis, we’ve decided to only look at countries with more than 20 unicorn companies.
tmp <- as.data.frame(table(df$Country))
tmp <- tmp[tmp$Freq > 20,]
df <- df[df$Country %in% tmp$Var1,]
A majority of the startups’ valuations are clustered around $1 billion. This is certainly caused by the cut off point at which companies are considered to be unicorns. As such, viewing all the companies together, we see a significant number of outliers. It begs the question of how many of these higher valued companies would still be considered outliers if this dataset included companies valued under $1 billion. Still, we can separate the companies by country, and look at the distributions of valuations through that lens.
Valuation separated by country.
Here, we see a similar but slightly different picture. Most notably, the countries with significantly more unicorn companies, have a tendency to have a significant number of these much higher valued companies and, as a result, significantly more outliers. Interestingly enough, these highly valued outliers are still not large enough in number to significantly shift the median.
Next, lets look further at all of these outliers by examining a companies valuation compared to the total money the company has raised. Due to the significant number of companies based in the United States, the US has an outsize influence on this regression line.
Interestingly though, the countries with less total companies seem to have companies that raise more money (compared to the each company’s valuation). This begs some interesting questions: Are these very highly valued companies actually able to raise the funds they need to be successful? Are the unicorns from countries with less total companies more likely to be successful, as they are raising more money (when compared to their valuation)? Does this simply mean that many of the highly valued companies are over valued?
The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. Essentially, as our sample size grows, the means of the samples will converge towards the mean and create a normal distribution. Below shows the distributions of 1000 random samples of sample sizes 10, 20, 30, and 40.
We saw earlier that the distribution of valuations was not normal and had a significant right skew. This skew can be seen quite clearly when the sample size is only 10, but we can see the skew essentially disappear once the sample size grows large enough.
## [1] "Sample Size = 10 Mean = 3.57208689822019 SD = 2.5457077984006"
## [1] "Sample Size = 20 Mean = 3.46541379098787 SD = 1.77535883477021"
## [1] "Sample Size = 30 Mean = 3.47917694964797 SD = 1.41146025849021"
## [1] "Sample Size = 40 Mean = 3.49287885883917 SD = 1.17044277917296"
## [1] "Total Size = 789 Mean = 3.46157160963245 SD = 7.96289444123421"
Here, we use sampling to get a representative portion of the population in order to perform further analysis. There are a variety of sampling methods and here we’ve chosen a few to look at the subsets that are created by the various sampling methods.
Original Data
SRSWOR
Systematic Sampling
Stratified Sampling
## Stratum 1
##
## Population total and number of selected units: 143 14.49937
## Stratum 2
##
## Population total and number of selected units: 23 2.332066
## Stratum 3
##
## Population total and number of selected units: 23 2.332066
## Stratum 4
##
## Population total and number of selected units: 61 6.185044
## Stratum 5
##
## Population total and number of selected units: 39 3.954373
## Stratum 6
##
## Population total and number of selected units: 500 50.69708
## Number of strata 6
## Total number of selected units 80
Sampling Conclusion
Total Data
SRSWOR
Systematic
Stratified